ISMB 2020 Online Workshop and Tutorial Program
ISMB 2020 will hold a series of online workshops and tutorials prior to the start of the ISMB 2020 virtual conference scientific program.
Tutorial Registration is Closed.
- Registration Fees
- Tutorial 1: Mutational signature analysis: pipelines, machine learning, and benchmarking on synthetic data
- Tutorial 2: Finding and analyzing data in the cloud with Gen3, Dockstore, Terra, and Galaxy
- Tutorial 3: Full-Length RNA-Seq Analysis using PacBio long reads: from reads to functional interpretation
- Tutorial 4: A practical introduction to biomedical text mining in the era of deep learning
- Tutorial 5: BioC++ - solving daily bioinformatic tasks with C++ efficiently
- Tutorial 6: Translational use of multifaceted RNA-Seq bioinformatics analysis in genetic disease investigation
- Tutorial 7: Automation of Network Analysis in the Cytoscape Ecosystem
Tutorial 1: Mutational signature analysis: pipelines, machine learning, and benchmarking on synthetic data
Saturday, July 11, 9:00 am - 1:00 pm (Eastern Daylight Time)
Sunday, July 12, 9:00 am - 1:00 pm (Eastern Daylight Time)
Presenters
Steven G. Rozen, Duke NUS Centre for Computational Biology Duke-NUS Medical School, Singapore.
Arnoud Boot, PhD Postdoctoral Fellow Duke-NUS Medical School, Singapore
Ferran Muiños, Institute for Research in Biomedicine Barcelona, The Barcelona Institute of Science and Technology, Barcelona, Spain
Overview
Mutational signature analysis focuses on patterns of mutations across the genome to infer their causes, and is now an essential component of cancer-genomics studies. Over the last decade, mutational signatures have revealed endogenous mutational processes that are widespread in many cancer types but that were not previously known. Signatures also showed that exposure to naturally occurring mutagens that cause liver cancer is much more widespread than suspected. Mutational signature analysis can also provide insight into the causes of specific oncogenic mutations and can reveal gaps in our understanding of the mechanisms of DNA damage and repair. Mutational signatures can either be delineated in experimental systems (e.g. cell culture or rodents) or can be discovered by machine learning across sets of hundreds to 10s of thousands of tumors. More than 100 mutational signatures have been described, many of which have unknown causes. In line with the importance of mutational signature analysis, there are now ~20 software packages that use machine learning to discover mutational signatures and assess their activity in tumors. Unfortunately, however, the cancer genomics literature contains numerous erroneous mutational signature results stemming from uncritical application of these packages.
We will cover the basic concepts of mutational signature analysis and show how this analysis is important for understanding cancer development, for detecting mutational exposures that cause cancer, and for understanding DNA damage as processed by normal and defective DNA repair. We will introduce the computational analysis needed to delineate mutational signatures in experimental systems (e.g. cultured cells or rodents), including the computational subtraction of the signatures of background mutagenesis and of experimental artifacts. We will cover in detail machine learning approaches to discovering mutational signatures in large sets of tumors and the strengths and weaknesses of these approaches. We will also discuss in depth the importance of benchmarking the machine-learning approaches on synthetic data. Finally, the tutorial will show examples of the importance of interpreting machine-learning results in the light of all available evidence to obtain biologically relevant results.
This tutorial will equip participants with the ability to run machine-learning software to discover mutational signatures and to assess their activity in tumors and with strategies to evaluate the soundness and biological relevance of the results.
Learning Objectives
(1) Understand basic concepts of mutational signature analysis. Understand the importance of mutational signature analysis for research into cancer development, for detecting mutational exposures that cause cancer, and for studying how endogenous and exogenous DNA damage as processed by normal and defective DNA repair leads to particular mutational signatures.
(2) Understand computational analysis for delineating mutational signatures in experimental systems, such as cultured cells or rodents, including subtracting signatures of background mutagenesis and of experimental artifacts.
(3) Understand machine learning approaches to discovering mutational signatures in large sets of mutational spectra plus the opportunities and challenges in using these approaches. Understand available software implementing these approaches. Understanding strategies for interpreting the results in the light of all available evidence to resolve unavoidable ambiguities and assess biological relevance.
(4) Understanding how testing machine learning methods on synthetic data as revealed the strengths, weaknesses, of different approaches
(5) Understand how processes of DNA damage, repair, and replication interact with genomic landscape.
Audience
Computing experience: there may be exercises using the command line in R; we will also share code snippets written in Python. Currently we hope that most computation can be handled using web servers. Participants will need a basic understanding of genome organization, mutations, and modern high throughput Illumina-type sequencing (BAM files, variant call files, etc.)
A list of small data sets and possibly software to be pre-downloaded onto students' computers will be available one week before the tutorial.
Maximum Participants: 60
Schedule Overview - Saturday July 11 - 9:00 - 1:00 pm Eastern Daylight Time
9:00 - 9:05 am | Overview of mutational signatures |
9:05 - 9:25 am | Overview of mutational signatures |
9:25 - 10:30 am | Lecture 1, Arnoud Boot, Mutational signatures and experimental elucidation of mutational signatures
What are mutational spectra and what are they good for |
10:30 (?) | (Hands on) Computational analysis of experimentally delineated mutational signatures; subtracting the signatures of background mutagenesis and of experimental artifacts |
Noon 1:00 | Lecture 2 Steve Rozen Machine learning for discovering mutational signatures • The twin problems of signature discovery and determining how much of each signature is present in a tumor (“signature attribution”) • Signature attribution as a separate problem from signature discovery (using COSMIC and/or experimental signatures) • Non-negative matrix factorization based approaches • Challenges in signature discovery and attribution: number of signatures, biological relevance, sparsity versus over-fitting - Discovery and attribution are not purely algorithmic processes -- they require human judgement |
1:00 pm | End for Saturday |
Schedule Overview - Sunday July 12 - 9:00 - 1:00 pm Eastern Daylight Time
9:00 - 9:30 am | Lecture 2 (continued) Steve Rozen Machine learning for discovering mutational signatures
• Hierarchical Dirichlet process approaches |
9:30 - 10:30 am | Exercises, machine learning / data mining for discovery, assessment with synthetic data |
10:30 - 11:00 am | Lecture 3 - Ferran Muiños - Signatures and Genomic Landscapes: Common Themes and Tactics in Genomic Landscape Analyses
Mutational profiles: from relative frequencies to conditional probabilities. A case for normalization: exome vs whole-genome data. Context of inference vs context of application. |
11:00 - 11:15 pm | Break |
11:15 - 1:00 pm | Wrap up discussion, future prospects and challenges, pointers to resources |
1:00 pm | End of the course |
Tutorial 2: Finding and analyzing data in the cloud with Gen3, Dockstore, Terra, and Galaxy
Thursday, July 9, 9:00 am - 1:00 pm (Eastern Daylight Time)
Agenda with Tutorial MaterialsPresenters
Geraldine Van der Auwera, Broad Institute of MIT and Harvard, United States
Robert Majovski, Broad Institute of MIT and Harvard, United States
Overview
The era of big data for biomedical research is here. Massive data sets and cloud-based platforms will enable breakthrough discoveries while overcoming challenges of cost, accessibility, and security. A key strength of this new research landscape is interoperable, community-driven components that enable robust analyses for a variety of research needs.
Audience
Researchers and bioinformaticians interested in ways to maximize data and analysis resources in the cloud. The ideal tutorial participant will have coding experience and basic familiarity with genomics terminology and standard high-throughput sequencing data formats.
Goals
Guide you through the capabilities and components of the NHGRI Genomic Data Science Analysis, Visualization and Informatics Lab-space (AnVIL) resource. Gain working knowledge of how the components work together to perform an end-to-end genetic analysis.
Slack: Please join us in the #ismb-2020 channel at https://anvilproject.org/contact
Virtual Event Agenda, all times ET
- 9:00 AnVIL: A new vision for Analysis in the Cloud………..…….. Presenter Mike Schatz [PDF]
- 9:15 Data that’s better, bigger, faster in the AnVIL ………..……... Presenter Liz Kiernan [PDF]
- 9:25 Intro to Terra Overview ………..…….………..…….………… Presenter Tiffany Miller [PDF]
- 9:40 Get set up in Terra (hands-on)………..…….………..………. Presenter Allie Hajian [PDF]
- 9:55 Break
- 10:10 Data and documentation in a Workspace………..…….…… Presenter Tiffany Miller [PDF]
- 10:30 Find and import workflows in Dockstore (hands-on)………. Presenter Tiffany Miller [PDF]
- 10:45 Set up and run your workflow (hands-on)………..…….…….Presenter Tiffany Miller [PDF]
- 11:00 Break
- 11:20 Workflows outputs and troubleshooting………..…….……… Presenter Jason Cerrato [PDF]
- 11:30 Interactive analysis (hands-on) plus Hail intro...……………. Presenter Allie Hajian [PDF]
- 11:50 Break
- 12:05 Bioconductor for RNA-seq analysis (hands-on)………..……Presenter Liz Kiernan [PDF]
- 12:50 Wrap-up / Q&A………..………………………………………...Presenter Mike Schatz [PDF]
Tutorial 3: Full-Length RNA-Seq Analysis using PacBio long reads: from reads to functional interpretation
Sunday, July 12, 9:00 am - 1:00 pm (Eastern Daylight Time)
Presenters
Ana Conesa, University of Florida, United States
Elizabeth Tseng, Pacific Biosciences, United States
Angeles Arzalluz, Polytechnical University Valencia, Spain
Francisco Pardo, Polytechnical University Valenciam, Spain
Carmen Guarco, Pacific Biosciences, United States
Overview
The PacBio Single-Molecule Real-Time sequencing technology produces highly accurate long reads that is suitable for full-length RNA sequencing. The Iso-Seq method generates full-length transcript sequences of 10 kb or longer that does not require transcript assembly or error correction. The high accuracy (>99%) of Iso-Seq transcripts allows for unambiguous characterization of alternative splicing events, direct ORF prediction without a reference genome, and identification of single cell barcodes.
The unique features of Iso-Seq data requires a special set of bioinformatics tools that typical short read RNA-seq tools fail to provide. The PacBio SMRT Analysis software processes raw sequencing data into full-length transcript sequences, which can then be analyzed with community tools that have been developed specifically for long read data: SQANTI compares Iso-Seq transcripts against known annotations (ex: GENCODE) to classify novel vs known genes and transcript, and remove artifacts; IsoAnnot functionally annotates Iso-Seq transcripts; tappAS compares multiple Iso-Seq samples to identify differential features. Existing RNA-Seq short read data are often paired with Iso-Seq data to strengthen the analysis.
Further, the Iso-Seq method can also be applied to single cell analysis. Matching single cell libraries of both long and short read data can be generated and combined to using the deeper coverage of short reads to identify cell types, while using matching cell barcodes to link fulllength isoforms generated by the long-read data back to individual cell types.
In this tutorial, we provide an overview of the Iso-Seq tools for both bulk and single cell RNAseq analysis and guide the audience through hands on analyses.
Audience
Beginner or intermediate. This tutorial will be of broad interest to researchers from academia or industry who want to learn to understand the unique features and tool sets of long read RNA sequencing (Iso-Seq) data using PacBio’s SMRT Technology.
Attendees are expected to have basic Unix command line skills and some familiarity with R/Rstudio. Programming knowledge is not required though most of the tools are written in Python.
Maximum Audience: 30
Requirements
Attendees are expected to bring their own laptops and have installed R/RStudio and the tappAS software. We will be using a shared instance in AWS for the first part of the analysis (Iso-Seq and SQANTI), then running tappAS on the local laptops.
Schedule Overview
9:00 - 9:30 am | Introduction
|
9:30 - 10:15 am | Demo & Hands-On Session: Iso-Seq using BioConda
|
10:15 - 11:00 am | Demo & Hands-On Session: Functional analysis of Iso-Seq data
|
11:00 - 11:15 am | Break |
11:15 - 11:45 am | Single Cell Iso-Seq
|
12:15 - 12:45 pm | Hands-On Session: Single Cell Iso-Seq + RNA-Seq
|
12:50 - 1:00 pm | Wrap Up |
Tutorial 4: A practical introduction to biomedical text mining in the era of deep learning
Presenters
Qingyu Chen, National Library of Medicine, National Institutes of Health
Robert Leaman, National Library of Medicine, National Institutes of Health
Cecilia Arighi, Delaware Biotechnology Institute, University of Delaware
Zhiyong Lu, National Library of Medicine, National Institutes of Health
Overview
The volume of biomedical literature is growing at an exponential rate. PubMed, a biomedical literature search engine managed by the National Library of Medicine, has ~2 new articles indexed per minute. Such rapid growth challenges manual information extraction, curation and annotation. Biomedical text mining aims to apply natural language processing techniques to biomedical literature and automatically assist biocurators, biologists and health professionals to overcome the burden. Biomedical text mining has matured significantly in recent years. More specifically, deep learning – end-to-end neural networks inspired by biological systems – has achieved state-of-the-art performance in a range of biomedical text mining applications. In the bioinformatics community, the use of text mining via deep learning to support other research in the biological and medical sciences has been increasing. Not restricted to standalone tools, deep learning models have also been fully deployed to public web servers, further improving the quality of biomedical text mining tools and lowering the barriers for non-specialists.
This tutorial aims to familiarize the audience with an introduction to text mining the biomedical literature using deep learning methods and to provide hands-on training. The tutorial will address questions such as “What is biomedical text mining?”, “What is deep learning?”, “How can deep learning be applied to address biomedical text mining problems?”, and “What biomedical text mining tools are currently available?”. The tutorial will cover the basics of biomedical text mining and deep learning with concrete examples. The latest deep learning methods in biomedical text mining will also be explained and discussed. Also, the audience will have the opportunity to get the first hands-on experience to develop their deep learning models in biomedical literature analysis. Topics include:
- Fundamentals of biomedical text mining and literature mining
- Overview of deep learning in biomedical text mining
- Word, sentence, concept embeddings for biomedical textual analysis
- Public biomedical text mining tools for biomedical information retrieval and extraction
- Case studies: biomedical literature analysis
This tutorial is an activity of the ISCB COSI on Text Mining.
Audience
We intend the tutorial to be for participants who are not text mining specialists but use or are interested in using it. This tutorial will provide a brief introduction, including describing existing tools and datasets. In addition, the session will provide an opportunity to describe their needs to text mining specialists.
Maximum Audience: 60
Requirements
None, if participants just wish to listen. Those who would like to also participate in the hands-on exercises are required to provide their own laptop and should have a basic knowledge of programming in Python.
Schedule Overview
9:00 - 9:25 am | Introduction to biomedical text mining
|
9:25 - 9:30 am | Introduction to biomedical text mining Short Break |
9:30 - 9:55 am | Introduction to deep learning
|
9:55 - 10:00 am | Introduction to biomedical text mining Short Break |
10:00 - 11:00 am | Biomedical language models
|
11:00 - 11:15 am | Long Break |
11:15 - 12:00 pm | S4. Demonstration: deep learning tools and datasets for biomedical text mining tasks
|
Tutorial 5: BioC++ - solving daily bioinformatic tasks with C++ efficiently
Sunday, July 12, 9:00 am - 1:00 pm (Eastern Daylight Time)
Presenters
René Rahn, Max Planck Institute for Molecular Genetics, Algorithmic Bioinformatics, Germany
Svenja Mehringer, Free University Berlin, Algorithmic Bioinformatics, Germany
Marcel Ehrhardt, Free University Berlin, Algorithmic Bioinformatics, Germany
Overview
In this half-day tutorial we are going to teach how to use modern C++ and utilise modern C++ libraries to rapidly develop tools and scripts for operating on and manipulating large-scale sequencing data.
Motivation
The high variability and heterogeneity often observed within various genomic data is challenging for many standard tools, for example for read alignment and variant calling. Often, these tools are wrapped in complicated pre- and postprocessing data curation steps in order to obtain results with higher quality. However, these additional steps incur a high maintenance and performance burden to the established work process and often do not scale with larger data sets. Seldomly, C++ is considered as the language of choice for these small processes, although it is the main language used in high-performance computing. We are going to show that implementing modern C++ can be as easy as using other modern high-level languages.
Course outline:
This tutorial is organised as a half-day tutorial. At the beginning we are going to introduce fundamental concepts and principles of the C++ programming language. Further, we will teach how modern C++ features such as ranges and concepts can be used to rapidly develop high-quality C++ applications. This introduction to C++ follows a practical session were participants will read in typical files from sequencing experiments using the C++ library SeqAn and operate on the data with the taught principles to solve diverse problems, e.g. filtering out reads with low sequencing quality and others. In the last 30 minutes of the day we are going to summarise the learned concepts and compare the developed methods to current approaches.
Audience
This tutorial is mostly suited for computational biologist and bioinformaticians with research focus on sequence analysis (e.g., genomics, metagenomics, proteomics, read alignment, variant detection, etc.). A fundamental knowledge about sequencing experiments and the involved data is required. We expect that attendees have an intermediate knowledge in programming with any high-level programming language, e.g. Python, Java or C++. Some basic C++-knowledge is helpful but not mandatory to successfully complete the course.
This tutorial is targeting beginners and intermediate C++ developers that want to learn more about modern C++ features like ranges and concepts.
Requirements:
Attendees should bring their own laptop.
Software for the tutorial can be installed beforehand, but we will also dedicate some extra time for installing required software during the tutorial.
- Git
- g++ >= 7
- SeqAn 3 - (https://github.com/seqan/seqan3)
- CMake >= 3.12
or, VirtualBox if the attendee wishes to use the provided virtual image running Ubuntu.
Maximum Attendees: 30
Schedule Overview
9:00 - 10:30 am | Introduction to modern C++ [talk: 30 min] Initial app and parsing sequencing data [hands-on: 60 min] |
10:30 - 11:00 am | Break |
11:00 - 12:30 pm | Filtering and data manipulation (hands-on) |
12:30 - 1:00 pm | Wrap-up [talk: 30 min] |
Tutorial 6: Translational use of multifaceted RNA-Seq bioinformatics analysis in genetic disease investigation
Sunday, July 12, 9:00 am - 1:00 pm (Eastern Daylight Time)
Presenters
Gavin R. Oliver, Center for Individualized Medicine, Mayo Clinic, United States
Garrett Jenkinson, Mayo Clinic, United States
Eric W. Klee, PhD, Center for Individualized Medicine, Mayo Clinic, United States
Overview
RNA-Seq is increasingly being recognized as a testing modality with significant untapped potential in the field of genetic disease studies. These data present a unique opportunity for diverse multifaceted analysis. Data profiling methods including expression outlier analysis, aberrant splicing detection, fusion transcript identification and allele-specific expression have been demonstrated to achieve genetic diagnosis of diseases escaping resolution through traditional clinical and research-based DNA-testing. Recent published works have highlighted the ability to increase diagnostic rates by as much as 35% utilizing RNA-Seq analysis, but analytical workflows are diverse and non-trivial to implement or interpret. This tutorial focuses on the utilization of RNA-based analysis for the improved diagnosis of rare genetic disease. An introduction will be given to the current state of genetic disease diagnostics and the benefits revealed to date by RNA-Seq. RNA-based testing paradigms will be introduced individually and discussed in terms of translational utility with a focus on data analysis methodologies and considerations. Each computational analysis solution will be overviewed with hands-on sessions highlighting the analytical capabilities of a specific informatics solution for each testing paradigm. Means of prioritizing results based on biological and phenotypic relevance will be addressed and cutting-edge computational solutions demonstrated. Finally consideration will be given to the principles and considerations underlying final data integration, review and analysis to maximize the likelihood of patient diagnosis amidst a growing data deluge.
Audience
Researchers or scientists with computational or genomics training and an interest in analytical techniques aimed at the improved diagnosis or rare genetic disease. Individuals with prior and current experience in the field of rare genetic disease will benefit from the ability to utilize the knowledge gained immediately in their own work. Experience programming in R would be useful but practical sessions will be conducted within a Jupyter environment, enabling code to be followed and executed without programming expertise. Attendees wishing to perform the practical components of the hands-on sessions are required to provide their own laptop.
Maximum Audience: 40
Schedule Overview
9:00 - 9:30 am | Introduction
|
9:30 - 10:00 am | Confounding variable correction and outlier expression analysis
|
10:00 - 10:35 am | Hands-On Practicum: OUTRIDER expression analyses
|
10:35 - 11:05 am | Fusion transcript detection in rare genetic disease
|
11:05 - 11:20 am | Break |
11:20 - 11:55 am | Hands-On Practicum: Fusion filtering and prioritization
|
11:55 - 12:25 pm | Identification of aberrant splicing events in rare genetic disease patients
|
12:25 - 1:00 pm | Hands-On Practicum: Leafcutter for detecting splicing outliers
|
Tutorial 7: Automation of Network Analysis in the Cytoscape Ecosystem
Sunday, July 12, 9:00 am - 1:00 pm (Eastern Daylight Time)
Tutorial 7 MaterialsPresenters
Dexter Pratt, UC San Diego School of Medicine, United States
Alexander Pico, Gladstone Institutes, United States
John “Scooter” Morris, UCSF, United States
Overview
Cytoscape, one of the most popular tools for network analysis and visualization, is evolving into an ecosystem of web applications and cloud services integrated with the original desktop application. In this workshop, we will demonstrate new workflows involving core components of the ecosystem and the methods by which they can be automated for integration with your scripts, web applications, Cytoscape desktop apps. The workflows will use ecosystem components including the Cytoscape desktop, the NDEx public database, the new Cytoscape Integrated Query application (IQuery), and libraries in R, Python, and Javascript. We will begin with an overview of the ecosystem, and then discuss how its components can be applied to two common tasks: the analysis of molecular interaction data and the interpretation of gene sets. The bulk of the workshop will be a hands-on demonstration of how to use standard components in each programming environment in a workflow involving protein interaction data.
Audience
This tutorial is intended for an audience that has prior experience with:
- R or Python
- Basic Javascript
- The Cytoscape desktop application
- Bioinformatics analysis using R or Python
Participants are required to bring a laptop with a Cytoscape 3.8, either R and RStudio or Python 3.5+ and Jupyter notebooks, and an environment for web / Javascript development installed. The Chrome or Edge browsers are preferred. Detailed instructions will be provided in the weeks prior to the tutorial.
Maximum Audience: 60
9:00 - 9:40 am | Introduction
|
9:40 - 10:20 am | Setting up the Workspace
|
9:20 - 11:00 am | Network I/O to NDEx and Basic Visualization
|
11:00 - 11:15 am | Break |
11:15 - 12:00 pm | Data to Networks
|
12:00 - 1:00 pm | Additional Topics and Q&A
|
Registration Fees
ISCB MEMBER FEES - Virtual Tutorials |
High Income Countries | Middle-Low Income Countries | Low Income Countries |
Student (Tutorial 1) | $100.00 | $50.00 | $20.00 |
Post Doc (Tutorial1) | $100.00 | $50.00 | $20.00 |
Professional: Academic; Non-profit; Government; or Corporate (Tutorial 1) | $100.00 | $50.00 | $20.00 |
Student (Tutorials 2 - 7) | $50.00 | $25.00 | $10.00 |
Post Doc (Tutorials 2 - 7) | $50.00 | $25.00 | $10.00 |
Professional: Academic; Non-profit; Government; or Corporate (Tutorials 2 - 7) | $50.00 | $25.00 | $10.00 |
NON-MEMBER FEES - Virtual Tutorials (fee includes 1 year ISCB membership) Tutorial 1 will be held on two mornings Tutorials 2 - 7 will be held on one morning All times Eastern Daylight Time |
High Income Countries | Middle-Low Income Countries | Low Income Countries |
Student (Tutorial 1) | $165.00 | $85.00 | $35.00 |
Post Doc (Tutorial1) | $195.00 | $85.00 | $35.00 |
Professional: Academic; Non-profit; Government; or Corporate (Tutorial 1) | $240.00 | $105.00 | $55.00 |
Student (Tutorials 2 - 7) | $110.00 | $55.00 | $25.00 |
Post Doc (Tutorials 2 - 7) | $140.00 | $55.00 | $25.00 |
Professional: Academic; Non-profit; Government; or Corporate (Tutorials 2 - 7) | $185.00 | $75.00 | $40.00 |